Intrinsic noise in game dynamical learning 
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Demographic noise has profound effects on evolutionary and population dynamics, as well as 
on chemical reaction systems and models of epidemiology. Such noise is intrinsic and due to the 
discreteness of the dynamics in finite populations. We here show that similar noise-sustained tra- 
jectories arise in game dynamical learning, where the stochasticity has a different origin: agents 
sample a finite number of moves of their opponents inbetween adaptation events. The limit of infi- 
nite batches results in deterministic modified replicator equations, whereas finite sampling leads to 
a stochastic dynamics. The characteristics of these fluctuations can be computed analytically using 
methods from statistical physics, and such noise can affect the attractors significantly, leading to 
noise-sustained cycling or removing periodic orbits of the standard replicator dynamics. 
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Intrinsic noise has been seen to have significant ef- 
fects on dynamical systems, and may alter their at- 
tractors substantially. Noise-sustained oscillations, 
generated via an amplification mechanism, are for 
example present in models of population dynamics 
1|, epidemiology 2] or biochemical reaction systems 
3]. The origin of these fluctuations is the discrete- 
ness of the dynamics in finite systems, determinis- 
tic descriptions are then no longer appropriate. The 
class of systems in which intrinsic noise cannot be ne- 
glected includes models of evolutionary dynamics and 
game theory, and much current research aims at un- 
derstanding the effects of this demographic stochas- 
ticity using methods from nonequilibrium statistical 
mechanics and the theory of stochastic processes 

Here, we will focus on intrinsic noise resulting from 
a different origin, and will consider the learning dy- 
namics of agents in a game theoretic setting || . This 
is complementary to more conventional approaches to 
game theory concentrating on the characterisation of 
equilibrium points Q, or on evolutionary processes 
7]. In the learning scenario one considers a small 
number of agents who interact repeatedly in a given 
game, and who observe their opponents' actions and 
aim to react by adapting their own strategy profile. 
Such dynamical models are of particular importance 
for the understanding of experiments in game theory 
and behavioral economics, in which human subjects 
play a given game repeatedly under controlled condi- 
tions [a Q • As a key result we show that stochastic- 
ity, induced by imperfect sampling of the opponents' 
strategy profiles, can result in trajectories quite dif- 
ferent from those of deterministic learning, very much 
akin to the mechanism by which intrinsic noise in fi- 
nite populations affects the trajectories of evolution- 
ary systems. While the amount of intrinsic noise in 
evolutionary dynamics is determined by the number 



of individuals in the population, our objective here 
is to characterise the fluctuations in the learning dy- 
namics of two fixed agents. The quantity controlling 
the noise strength is the number of observations made 
by the agents inbetween adaptation events. Further- 
more, in a deterministic setting and depending on the 
game, we demonstrate that memory loss can promote 
or impede convergence to a Nash equilibrium. 

Consider a general symmetric two-player game, 
played repeatedly by players X and Y, and assume 
there are p pure strategies in this game. The payoff 
matrix is given by where i,j S {1, .. . ,p}. The 
rounds of the repeated interaction will be labeled by 
t = I, 2, ... in the following. In each round player X 
plays one pure strategy i(t) G {1, . . . , p}, and player 
Y plays j(t) £ {1, . . . ,p}. The payoff for X is then 



mm 



and that for Y is 



If the players play 



stochastically, i.e. if they resort to mixed strate- 
gies, i(t) and j(t) will be random variables. Assum- 
ing that player X carries a (time-dependent) mixed 
strategy profile x(t) = (xi(t), . . . , x p (t)) and simi- 
larly y(t) = (yi(t), . . -,y P (t)) for player Y, a learn- 
ing dynamics is then a prescription used to update 
these strategy profiles between subsequent rounds of 
the game. Xi{t) here denotes the probability with 
which player X plays pure strategy i £ {l,...,p} in 
round t, and similarly for yj(t). Normalization re- 
quires y%=i Xi(t) = fTj=i Vj{t) = !• 

In order to define a specific learning dynamics, we 
follow [§, and assume that each player keeps valu- 
ations of each pure strategy, measuring their relative 
performance in the past. More precisely, in a situa- 
tion without memory loss, the valuation qi(t) player 
X has for pure strategy i is the total payoff X would 
have obtained, had he/she always played strategy i 
up to time t, and given F's actions. The valuation 
rj(t) player Y has for j has an analogous meaning. 
Following [l(| players then use a logit rule 
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T<2i(«) 



T^-(t) 
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r > here sets the scale of the score valuations, and 
is known as the response sensitivity [9( . While T = 
corresponds to random response, and r = oo to deter- 
ministic play, we will here focus on the case in which 
< r < oo. It is important to distinguish between 
two types of randomness in the actual play: as pre- 
scribed by ([1]), the players will generally use mixed 
strategies, so that their actions can be stochastic, 
even at given strategy valuations. Secondly, the up- 
date of the valuations itself will contain some stochas- 
ticity as we will detail next. We will here assume that 
players update their scores only once every TV rounds 
of the game, and keep them constant inbetween. This 
is known as batch learning in computer science [Hj]. 
Specifically, we will assume 



learning (DTDL). Assuming T <C 1 a continuous-time 
limit [10( leads to the modified replicator equations, 



t+N-l 



q k (t + N) = (1-X)q k (t) + — a mt>) 

t'=t 
* t+N-l 

r k (t + N) = (l_A)r fc (t) + — J2 a ^(*')» ( 2 ) 



and q k (t + r) = q k (t) for all r = 1,2,... j N — 1, and 
similarly for player Y. On-line learning [121 ]. i.e. up- 
dating after each round, is recovered for N = 1. In 
our model all {q^, rj} are updated at each adaptation 
event. This corresponds to reinforcement learning 
in which foregone payoffs are known and reinforced, 
equivalent to weighted fictitious play belief learning, 
see Ho et al. [§] . The interpretation of these update 
rules is understood best by first considering the case 
A = 0: then the increment of q k between time-steps t 
and t + N is given by TV -1 Y^t^t' 1 a kj(t')- This in- 
crement is recognized as the average payoff X would 
have received per round had he/she played pure strat- 
egy k in all rounds t, t + 1, . . . , t + N — 1. A non-zero 
value, A € (0, 1], accounts for memory loss. We here 
note that other approaches can be taken to describe 
memory-loss, for example one may introduce a pre- 
factor A in the payoff terms in Eq. @ . In this paper 
we follow the setup of [l(| • 

The update rules are intrinsically stochastic, we 
will refer to (1 1121) as discrete-time stochastic learning 
(DTSL). After a re-scaling of time, and for large, but 
finite batch size N we can write 



q k (£+l) = (l-X)q k (£) + ^2a kjVj (£) + 



N 



where we approximate the noise variables Cfc,?7fc as 
Gaussian random variables. This amounts to an ex- 
pansion in N^ 1 / 2 , and within this approximation the 
covariances of the j k , r\ k can be obtained, as we will 
report elsewhere [14]]. In the limit of infinite batch 
size, N — > oo, the dynamics becomes determinis- 
tic, we will refer to this as discrete-time deterministic 



ii/xi = T ajjUj - r/[x, y] + A x k In — 

Xi 



VjlVj = T^ajiXi - T/[y,x] + A^yfcln— ,(4) 

k y i 

where /[x, y] = Y^ij a ij x iyj> as previously reported 
and studied in [To| , see also [ll| • This system main- 
tains the normalisation of probabilities, and is hence 
2(p — l)-dimensional. DTDL gives rise to a discrete 
version of (jl]). For DTSL the map is supplemented 
by noise. We will denote fixed-points of the noiseless 
map by z* = (x*, . . . , x*, y*, . . . , y*), they are iden- 
tical to the fixed points of ^j. We now perform an 
expansion about the fixed point in powers of N^ 1 / 2 , 
akin to the expansion first proposed in [131 ]. Writing 
z(£) = z* + N-^ 2 A(£), one finds 



A(£+l) = IA(£) + C(e), 



(5) 



with JJ the Jacobian at the fixed-point, and where 
is Gaussian white noise, with correlations among 
its components, which can be worked out analyti- 
cally [14J . Eq. §5§ is the discrete-time analogue of a 
linear Langevin equation, and the starting point for 
the analysis of fluctuations about the deterministic 
limit. In particular Eq. ([5]) allows one to compute the 
stationary distributions of the components of A, as 
well as their temporal correlations and power spectra 

Pi(u) = ^|Aj(Lj)| 2 ^, with Aj(w) the Fourier trans- 
form of Ai(£) [H. This follows the lines of Ij. Here 
we will illustrate the effects noise has on the learning 
dynamics using the two examples of the prisoners' 
dilemma, and that of the rock-papers-scissors game. 

The prisoner's dilemma describes a problem of mu- 
tual cooperation, where two players each face the 
choice whether to co-operate (C) or to defect (D). Wc 
will here choose the payoff matrix acc = 3,ac.D = 
0,«dc — 5,ac£> = 1. The Nash equilibrium, and 
fixed-point of the standard replicator dynamics (A = 
0) is defection, and we will in the following discuss 
the outcome of the batch and on-line learning dy- 
namics with and without memory loss. As seen in 
Fig. [T^, the deterministic learning dynamics con- 
verges to a fixed-point, a numerical analysis shows 
that this fixed-point is symmetric with respect to the 
exchange of players (x* = y*). The defection rate of 
either player decreases with increasing memory loss 
(Fig. [TJd). The fixed-point of dU) depends only on 
the ratio A/T, and the different curves in Fig. [TJa can 
be collapsed. The learning dynamics at finite batch 
size and A > yields noisy trajectories fluctuating 
about the deterministic mean (Fig. [IJ;), averaging the 
noisy dynamics over independent runs reproduces the 
deterministic trajectory (Fig. [1^,). In Fig. [2] we ad- 
dress the nature of stochastic fluctuations in more de- 
tail. While deterministic learning converges towards 
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FIG. 1: (Color on-line). Defection rate in the prisoners' 
dilemma, (a) Dynamics at T = 0.5, A = 0, 0.25, 0.5, 0.75 
(top to bottom). Markers are from simulations of DTSL 
(N = 10, averaged over 1000 runs, defection rate shown 
for one fixed player), lines from DTDL; (b) Defection rate 
as a function of the memory- loss rate A for F = 1, 0.5, 0.1 
(top to bottom); (c) Single runs of the DTSL dynamics 
at N — 10, parameters as in (a). 



a mixed strategy fixed point, learning at finite batch 
sizes leads to a distribution of mixed strategy vec- 
tors as indicated in Fig. [2k- The width of these 
distributions scales as iV" 1 / 2 , and can be obtained 
from the theory to great accuracy. Panel [^b demon- 
strates that our analytical approach captures spectral 
properties of the fluctuations as well, and again near 
perfect agreement between theory and simulations is 
found. These results show that the expansion in the 
inverse batch size is a viable analytical tool for the 
characterization of stochastic effects in game dynam- 
ical learning, and we will proceed to apply it to a 
second matrix game in the following. 

Rock-papers-scissors (RPS) is a game with p = 3 
strategies and cyclic dominance, as indicated by the 
payoff matrix a RS = a S p = a PR = 1, a SR = 
aps = a RP = -1 and o,rr = a PP = ass = 0. 
If the system is started from symmetric initial con- 
ditions, (x R ,xp,x s ) = {yR,yp,ys), the continuous- 
time replicator dynamics, Eqs. (j4|) at A = reduces 
to a one-population dynamics, and these have one 
neutrally stable fixed-point at x* R = x* P = x* s = 1/3 , 
and with closed periodic orbits surrounding it [151 ]. 
The quantity H = —Iii(xrXpxs) — 31n3 is a con- 
stant of motion [l5| , which vanishes at the neutrally 
stable fixed point, and indicates a measure of dis- 
tance from this fixed-point. The symmetry between 
the two players can be broken as discussed in giv- 
ing rise to the possibility of limit cycles and chaotic 
motion, which we do not discuss here. We first inves- 




FIG. 2: (Color on-line). Defectors in the prisoners' 
dilemma, (a) Distribution of defection rates at F = A = 
0.5, N = 1000, 100, 10 from top to bottom at the peak, 
(b) Spectrum of fluctuations of defection rate. Symbols 
from simulations in both panels, solid lines from theory. 



tigate the case without memory loss in Fig. [3] The 
discrete-time learning dynamics at infinite and at fi- 
nite batch sizes does not proceed along the cycles of 
the continuous-time replicator dynamics, but instead 
it drifts towards the edges of the strategy simplex. 
Fig. [3^ shows the distance H from the center. This 
distance increases monotonically, so that the learn- 
ing dynamics operates mostly at the borders of the 
strategy simplex after some transient time. In the de- 
terministic case this effect is due to the discreteness 
in time of the learning process, the relevant eigen- 
values of map at the central fixed point are given by 
1 — A ± iT / V3) so that the fixed point is unstable for 
A < A c (r) = 1--/L -r 2 /3, and stable for A > A c . In 
the unstable regime fluctuations due to finite batch 
sizes enhance the outwards drift. 

The differences between the noise- free learning pro- 
cess and on-line adaptation for the case A > A c is 
studied in Fig. [H Here the fixed point of the DTDL 
dynamics is stable. The eigenvalues of the Jacobian 
JJ at the fixed point are complex, and hence a reso- 
nant amplification of fluctuations is possible similar 
to the enhanced demographic fluctuations reported 
in Indeed, Fig. H] shows that the stochastic learn- 
ing dynamics at finite batch size sustains coherent 
stochastic oscillations about the deterministic fixed- 
point. Their power spectrum can be computed based 
on an analysis of Eq. §5§ . Results are compared with 
simulations in Fig. 0J1, and as seen the agreement is 
excellent, provided the batch size is large enough to 
justify the expansion in N~ 1/>2 . Fig. 0] shows that 
this is the case even for small batch sizes, for other 
games this will most likely depend on the number of 
strategies available to the players. These phenom- 
ena are dynamically similar to those in evolutionary 
systems, where a linear scaling of extinction times 
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FIG. 3: (Color on-line). Rock-papers-scissors without 
memory loss (A = 0, T = 0.1). Main panel shows the 
distance H from the center of the simplex versus time. 
Solid line is the DTDL dynamics, markers from DTSL 
at finite batch size (averages over 1000 runs). The inset 
shows the frequency of one of the pure strategies versus 
time for DTDL and for one run of DTSL, and illustrates 
the drift towards the edges of the strategy simplex. 
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FIG. 4: (Color on-line) Rock-papers-scissors at A = 
0.01, r = 0.1. (a) Distance H versus time; (b) determin- 
istic and stochastic trajectories (N — 10) in the strategy 
simplex; (c) probability of playing rock for the same run as 
in (b); (d) power spectra of fluctuations for N — 1, 2, 3, 10 
compared to theory. 



in the system size have been reported for neutrally 
stable dynamics [J]. In the learning system there is 
no extinction, but escape times from a region around 
the fixed point can be measured [lij ]. and a similar 
linear scaling in the batch size is found for the neu- 
trally stable case A = A c . In the stable phase escape 
is sub-extensive, in the unstable regime escape times 
grow faster than linearly in N, very akin to what is 
reported in Q. 



Fluctuations in finite populations have profound 
consequences in evolutionary game theory, and we 
have here shown that similar stochastic effects can 
be seen in a learning-theoretic scenario. The source 
of noise is different from that in evolutionary sys- 
tems, and the analogue of finite populations are fi- 
nite batches of observations which players make inbe- 
tween adaptation events. Our analysis demonstrates 
that memory loss can lead the system away from 
Nash equilibria and bring about co-operation in so- 
cial dilemmas. In cyclic games such as RPS conver- 
gence is only possible with sufficient memory loss, the 
center of the strategy simplex then becomes a stable 
fixed point for deterministic learning. The stochas- 
ticity and discreteness in the adaptation dynamics 
can affect the asymptotic attractors considerably, and 
noise-sustained oscillations can be observed. These 
oscillations are induced by an amplification mecha- 
nism similar to that observed in population dynamics 
[l[ and in other biological systems, and may have sig- 
nificant amplitudes impeding the convergence to the 
Nash equilibrium. We expect this to be the case for 
a variety of different games and learning algorithms 
[14j, with compelling consequences for the learnabil- 
ity of games and their Nash equilibria. Determinis- 
tic learning of asymmetric games is known to lead 
to chaotic motion [Tfj], and we expect that a dy- 
namics with imperfect sampling would make it even 
less likely that the players collectively retrieve a Nash 
equilibrium. 



The author thanks J. D. Farmer for discussions, 
and Research Councils UK for financial support. 



[1] A. J. McKane and T. J. Newman, Phys. Rev. Lett. 

94 218102 (2005) 
[2] J. P. Aparicio, H. G. Solari, Phys. Rev. Lett. 86 4183 

(2001); D. Alonso, A. J. McKane, M. Pascual, J. 

Roy. Soc. Interface 4, 575 (2007); M. Simoes, M.M. 

Telo da Gama, A. Nunes, J. Roy. Soc. Interface 5, 

555 (2008) 

[3] A. J. McKane, J. D. Nagy, T. J. Newman and M. O. 

Stefanini, J. Stat. Phys. 128, 165-191 (2007). 
[4] A. Traulsen, J. C. Claussen, C. Hauert, Phys. Rev. 



Lett. 95 238701 (2005); J. Cremer, T. Reichen- 
bach, E. Frey, Eur. Phys. J. B 63 373 (2008); 
L. A. Imhof, D. Fudenberg, M. A. Nowak, Proc. 
Nat. Acad. Set. 102 10797 (2005); A. Traulsen, 
C. Hauert, preprint larXiv : 81 1 . 3538 A. Traulsen, 
J. M. Pacheco, L. A. Imhof, Phys. Rev. E 74 021905 
(2006); J. C. Claussen, A. Traulsen, Phys. Rev. Lett. 
100 058104 (2008) 
[5] D. Fudenberg, D. K. Levine, The theory of learning 
in games (MIT Press, Cambridge Mass., 1998); F. 



■5 



Vega-Redondo, Economics and the theory of games 
(Cambridge Univ. Press, Cambridge UK, 2003) 

[6] J. v. Neumann, O. Morgenstern Theory of games and 
economic behavior (Princeton Univ. Press, 1953) 

[7] J. Maynard Smith, G. Price, Nature 246 (1973) 15; 
J. Maynard Smith, Evolution and the theory of games 
(Cambridge University Press, 1998) 

[8] J. Henrich, R. Boyd, S. Bowles, C. Camerer, E. Fehr 
and H. Gintis (Eds), Foundations of Human Sociality 
(Oxford University Press, Oxford UK, 2004) 

[9] T. H. Ho, C. F. Camerer, J.-K. Chong, J. Econ. The- 
ory 133 177 (2007) 
[10] Y. Sato, J. P. Crutchfield, Phys. Rev. E. 67 



015206(R) (2003); Y. Sato, E. Akiyama, J. D. 

Farmer, Proc. Nat. Acad. Set USA 99 4848 (2002) 
[11] E. Ahmed, A. S. Hegazi, A. S. Elgazzar, Int. J. Mod. 

Phys. C 14 963 (2003) 
[12] D. Saad (Ed.), On-line learning in neural networks 

(Cambridge University Press, Cambridge UK, 1998) 
[13] N.G. van Kampen, Stochastic processes in physics 

and chemistry (Elsevier Science, Amsterdam 1992) 
[14] T. Galla (forthcoming, 2009) 

[15] H. Gintis, Game theory evolving (Princeton Univ. 
Press, Princeton NJ, 2000) 



